In recent years, standardized testing has faced increasing scrutiny, particularly following the COVID-19 pandemic, with some universities becoming test-optional schools. The debate around standardized testing, particularly regarding SAT scores, is rooted in the question of inequality in the access to opportunities and resources.
Numerous studies have been conducted to investigate the relationship between school demographics and academic outcomes. For example, dRocco d’Este and Elias Einiö's paper, "Asian Segregation and Scholastic Achievement: Evidence from Primary Schools in New York City," found that an increase in the share of Asian students in a high school correlated with a decrease in SAT scores for students of other races. Also, Atila Abdulkadiroglu, Weiwei Hu, and P. Pathak's paper, "Small High Schools and Student Achievement: Lottery-Based Evidence from New York City," revealed that smaller class sizes were often associated with higher acceptance rates, as students received more personalized attention.
Our paper seeks to explore the relationship between school demographics and average SAT scores in New York City public schools, drawing on a range of predictors such as income, location (borough and zip code data), population size, and a student's racial background. We wish to unify all the different studies that look at individuals’ predictors of scores into one comprehensive study with clear visualisations and support. The focus on New York City is attributed to the city’s impressive diversity of cultures, religions, ethnicities, and income brackets. Data is not homogenous and repetitive. Using data analysis techniques such as summary statistics, visual graphs, mapping, and regression, we aim to identify factors that impact SAT scores and provide insights into potential interventions and improvements that could be made to the education system, particularly in New York City.
In the following sections, we will present a detailed analysis of our data, before discussing the implications of our findings and suggesting potential areas for further research.
Y variable: educational outcomes (SAT scores)
X variable: share of minorities (race), location (borough, zip codes)
The Y variable in our study is the average SAT scores, which is a measure of student academic performance in a given school. The SAT scores, in this context, are marked out of 2400 (800 points per subject), which nowadays have been updated to be marked out of 1600. They serve as a benchmark for educational quality and play a crucial role in determining college prospects.
The X variable is the demographic makeup of the school, including the share of minorities, location (borough, zip codes), and other factors. These variables were selected as they can impact the educational environment and affect student performance.
For instance, studying the different boroughs can provide information on the socioeconomic status of the school's area and the share of minorities can give insight into the diversity of the student body and cultural background, which can impact the learning environment. Neighborhood resources and aid also play a role in student scores.
To further analyze the relationship between school demographics and academic performance, we merged additional datasets to obtain more X variables. These include information on unsafe neighborhoods, population, and income. Unsafe neighborhoods can affect the school's environment and student performance, while population and income can give insights into the school's community and resources available to students.
The X variables are important for the analysis as they provide insight into the relationship between school demographics and academic performance. By understanding these factors, we can improve educational outcomes for students in NYC public schools.
#Import packages
!pip install matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point
import json
%matplotlib inline
# Read the data in Python
data = pd.read_csv('/home/jovyan/Project1/Data/sat_scores.csv')
# Clean the data: drop unimportant columns and missing values
data.dropna(inplace = True)
data.reset_index()
# Create a dataframe for the data
df = pd.DataFrame(data)
# Eliminate percentage signs (%)
white = pd.to_numeric(df["Percent White"].str.replace("%",""))
black = pd.to_numeric(df["Percent Black"].str.replace("%",""))
hispanic = pd.to_numeric(df["Percent Hispanic"].str.replace("%",""))
asian = pd.to_numeric(df["Percent Asian"].str.replace("%",""))
df['Percent White'] = white
df['Percent Black'] = black
df['Percent Hispanic'] = hispanic
df['Percent Asian'] = asian
#Create a new column to see the average overall SAT score
total_sat = data["Average Score (SAT Math)"] + data["Average Score (SAT Reading)"] + data["Average Score (SAT Writing)"]
total_sat
df['Average Total SAT Score'] = total_sat
df.head()
| School ID | School Name | Borough | Building Code | Street Address | City | State | Zip Code | Latitude | Longitude | ... | Student Enrollment | Percent White | Percent Black | Percent Hispanic | Percent Asian | Average Score (SAT Math) | Average Score (SAT Reading) | Average Score (SAT Writing) | Percent Tested | Average Total SAT Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 01M539 | New Explorations into Science, Technology and ... | Manhattan | M022 | 111 Columbia Street | Manhattan | NY | 10002 | 40.71873 | -73.97943 | ... | 1735.0 | 28.6 | 13.3 | 18.0 | 38.5 | 657.0 | 601.0 | 601.0 | 91.00% | 1859.0 |
| 3 | 02M294 | Essex Street Academy | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | ... | 358.0 | 11.7 | 38.5 | 41.3 | 5.9 | 395.0 | 411.0 | 387.0 | 78.90% | 1193.0 |
| 4 | 02M308 | Lower Manhattan Arts Academy | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | ... | 383.0 | 3.1 | 28.2 | 56.9 | 8.6 | 418.0 | 428.0 | 415.0 | 65.10% | 1261.0 |
| 5 | 02M545 | High School for Dual Language and Asian Studies | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | ... | 416.0 | 1.7 | 3.1 | 5.5 | 88.9 | 613.0 | 453.0 | 463.0 | 95.90% | 1529.0 |
| 6 | 01M292 | Henry Street School for International Studies | Manhattan | M056 | 220 Henry Street | Manhattan | NY | 10002 | 40.71376 | -73.98526 | ... | 255.0 | 3.9 | 24.4 | 56.6 | 13.2 | 410.0 | 406.0 | 381.0 | 59.70% | 1197.0 |
5 rows × 23 columns
# Get relevant columns
summary = df[['Percent White', 'Percent Black', 'Percent Hispanic', 'Percent Asian', 'Average Score (SAT Math)',
'Average Score (SAT Reading)', 'Average Score (SAT Writing)', 'Average Total SAT Score']]
# Calculate summary statistics
summary_stats = summary.describe()
# Style
styled_table = summary_stats.style\
.set_properties(**{'border': '1px solid black', 'text-align': 'center'})
styled_table
| Percent White | Percent Black | Percent Hispanic | Percent Asian | Average Score (SAT Math) | Average Score (SAT Reading) | Average Score (SAT Writing) | Average Total SAT Score | |
|---|---|---|---|---|---|---|---|---|
| count | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 |
| mean | 8.524599 | 35.387166 | 43.929679 | 10.412567 | 432.719251 | 424.342246 | 418.286096 | 1275.347594 |
| std | 13.359205 | 25.367159 | 24.495584 | 14.400556 | 71.916833 | 61.884529 | 64.548388 | 194.866056 |
| min | 0.000000 | 0.000000 | 2.600000 | 0.000000 | 317.000000 | 302.000000 | 284.000000 | 924.000000 |
| 25% | 1.300000 | 16.400000 | 20.825000 | 1.600000 | 386.000000 | 386.000000 | 382.000000 | 1157.000000 |
| 50% | 2.600000 | 28.750000 | 45.300000 | 4.200000 | 414.000000 | 412.500000 | 402.500000 | 1226.000000 |
| 75% | 9.375000 | 50.100000 | 63.375000 | 11.150000 | 457.250000 | 444.500000 | 436.000000 | 1327.000000 |
| max | 79.900000 | 91.200000 | 100.000000 | 88.900000 | 754.000000 | 697.000000 | 693.000000 | 2144.000000 |
This table provides a summary of relevant data points.
Turning our attention to the SAT scores, we can see that the average SAT scores per different subjects, range between 418 and 432 points. Although the scores are fairly similar, there is a slight trend towards higher math SAT scores and lower english writing SAT scores. Overall, adding all three areas tested together, the average total SAT score has a mean of 1275 points.
Regarding the racial composition of the schools, our data set includes 375 observations, which represent 375 schools in NYC. On average, the student population is primarily composed of Hispanic students (43.9%), followed by Black students (35.3%). The remaining students are made up of Asian students (10.5%) and White students (8.58%).
The high representation of Hispanic students in these New York schools is noteworthy. We will further explore whether there is a correlation between the racial background of the students and their performance on the SAT.
#Create a histogram for the Y variable:
# Assign values to our variables
subjects = ['Math', 'English Reading', 'English Writing']
colors = sns.color_palette("mako", len(subjects))
# Create histogram
sat_scores = df[["Average Score (SAT Math)", "Average Score (SAT Writing)", "Average Score (SAT Reading)"]].copy()
plt.hist(sat_scores, color = colors, bins = 10)
# Assign information to the histogram
plt.xlabel("SAT Scores", loc="right")
plt.ylabel("Number of Students with Specific SAT Score", loc="top")
plt.title("Distribution of SAT Scores of Students in New York", fontsize=16)
plt.legend(subjects, loc='upper right')
<matplotlib.legend.Legend at 0x7f6521aeee60>
We will now examine the distribution of SAT scores among students in New York. The scores demonstrate a wide range, ranging from 300 to over 700 points per subject.
The majority of the data is centered around the 400-point mark, with approximately 160 to 175 schools scoring 400 points in either mathematics, reading, or writing.
When considering scores at the 500-point mark, which represent a higher level of achievement, students performed better in mathematics than in English reading and writing. This supports the trend we observed in our summary statistic. Specifically, roughly 50 students received 500 points in mathematics, whereas 32 students achieved the same score in the other two subjects.
#Average race of students per school
mean = pd.DataFrame(df, columns= ['Percent White','Percent Black','Percent Hispanic', 'Percent Asian'])
mean1 = mean.iloc[1]
colors = plt.cm.Set2(np.linspace(0, 1, len(mean1)))
# Create barchart and assort by ascending order
mean_sorted= mean1.sort_values(ascending = False)
ax = mean_sorted.plot.barh(color=colors, alpha=0.5)
# Setting barchart specifics
ax.set_title("Average Race of Students Per School (in %)", fontsize=16)
ax.set_xlabel("Percentage")
ax.bar_label(ax.containers[0], label_type = "center")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
This barchat depicts the racial composition of the student body in NYC schools. We notice that the majority of the students are either Hispanic or Black. On average, 38.5% of the students are Black and 41.3% are Hispanic, while White and Asian students are relatively underrepresented in comparison. Creating a barchart of the average race of students per school will help us in noticing disparities later on in the study.
# Create a barchart of the number of schools found in each borough
borough = df['Borough'].value_counts()
ax = borough.plot.barh(color=plt.cm.tab20c(np.arange(len(borough))), alpha=0.7)
# Set labels and extra
ax.bar_label(ax.containers[0], label_type = "center")
ax.set_title("Number of Schools in Each Borough", fontsize=16)
ax.set_xlabel("Number of Schools", loc = "right")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
The purpose of this barchart is to display the distribution of schools across different neighborhoods. According to the data, Brooklyn boasts the highest number of schools with 109, closely followed by the Bronx with 98. Conversely, our data only shows 10 schools in Staten Island where students have taken the SAT. The Bronx is commonly associated with being the tougher part of NYC, which leads us to wonder whether there will be disparities in SAT scores between that area and the richer Manhattan and Staten Island. In our subsequent analysis, we will explore whether a school's location - as represented by its borough - has any correlation with its SAT scores.
#SAT Score by borough
#Select data to plot
df2 = df[['Borough', 'Average Total SAT Score']]
df3 = df2.groupby(['Borough']).mean()
#Create bar chart
sns.set_style("white")
colors = sns.color_palette("Blues", len(df3))
ax = df3.plot.barh(color=colors, alpha=0.7)
#Set labels and remove automated legend
ax.bar_label(ax.containers[0], label_type = "center")
ax.set_title("Average Total SAT Score by Borough", fontsize=16)
ax.set_xlabel("Total SAT Score", loc = "right")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.get_legend().remove()
We present a visualization of the average SAT score per borough, sorted from highest to lowest. It is worth noting that Manhattan, known for having the highest real estate prices among the boroughs, has an average SAT score of 1240.13 points. In contrast, the Bronx and Brooklyn, which have the highest poverty rates at 24.4% and 17.8% respectively according to an article in 'The City' (1), seem to have the lowest average SAT scores at 1202.72 points and 1230.26 points respectively.
On the other hand, Staten Island and Queens have the lowest poverty rates at 10.6% and 10.3% respectively and the highest average SAT scores at 1439 points and 1343 points (1). This suggests a possible correlation between poverty rates and SAT scores in these boroughs.
Staten Island's SAT score is 100 points above Queens and Manhattan, who have very similar scores, and 200 points above Brooklyn and Bronx!
Overall, this visualization highlights the differences in academic achievement across the boroughs, which may be influenced by a variety of factors such as socioeconomic status, school resources, and cultural values."
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 6))
# Create overlapping scatter plots with regression lines and labels
sns.set_style("white")
sns.regplot(x=df['Percent White'], y=total_sat, ci=None, color='purple', label='White Students')
sns.regplot(x=df['Percent Hispanic'], y=total_sat, ci=None, color='orange', label='Hispanic Students')
# Add legend
plt.legend(loc='upper right')
# Set labels
plt.title("SAT Scores by Race of Students in NYC Schools", fontsize=16)
plt.xlabel("Percentage of Students by Race")
plt.ylabel("Combined SAT Score (600-2400)")
# Remove spines
sns.despine()
plt.show()
The figure shows a correlation between SAT scores and the race of students in schools. Schools with more white students have higher SAT scores, while schools with more Hispanic students have lower SAT scores - the orange line shows us this regression. Despite a noticeable cluster of points close to the origin and some variance in the data, the correlation is evident.
Our analysis of the data brings to light a disheartening phenomenon. Most schools that predominantly comprise Hispanic students perform worse than schools with a more modest representation of white students. This trend may be attributed to the fact that schools with a higher proportion of minorities generally receive less funding. The disparity observed here underscores the potential unfairness of standardized testing as an evaluation tool, given that schools with limited resources often produce less favorable outcomes than those that are more well-equipped.
The main message I wish to convey is that there does seem to be a relationship between SAT Scores and the borough in which a school is located. This disparity is much more visible than when comparing SAT score with race, based on what we have already analyzed. I want to plot a histogram or barchart that groups all the boroughs together, so although maybe messy, will show clear trends. As always, the Y variable is SAT scores, and the X variable will be boroughs / neighbourhoods. I want the colors to be discernible between each other, avoiding shades and hues of the same color.
import seaborn as sns
import matplotlib.pyplot as plt
# Select relevant columns
sat_borough = df[["Borough", "Average Total SAT Score"]]
# Group data by borough
sat_borough_grouped = sat_borough.groupby("Borough")
# Create kernel density plot for each borough
for borough, group in sat_borough_grouped:
sns.kdeplot(group["Average Total SAT Score"], label=borough, fill = True)
plt.legend()
plt.xlabel("Average Total SAT Score")
plt.ylabel("Density")
plt.title("Distribution of SAT Scores by Borough", fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
#plt.grid(axis='y', alpha=0.75)
plt.show()
The visualization we have created provides us with valuable insights into the distribution of SAT scores across different boroughs in New York City. One key observation is the trend towards an average SAT score of 1200. We can see that the majority of schools fall within the range of 1000-1400, with a peak frequency around the 1200 mark.
When we examine the distribution of schools by borough, we notice some interesting patterns. Schools located in Brooklyn and the Bronx have a very high frequency of SAT scores around the average mark of 1200-1300 points. However, these schools do not seem to reach scores above that range. On the other hand, schools located in Manhattan and Queens trend towards higher SAT scores, with some schools even achieving scores above 1500.
Another important finding from our previous analysis is that Staten Island schools tend to have SAT scores centered around the 1400 mark.
This reinforces the notion that there is a clear relationship between school location and SAT scores, as schools in certain areas tend to perform consistently better or worse than others.
import pandas as pd
import matplotlib.pyplot as plt
# Create a boxplot for the Average Total SAT Score by Borough
plt.boxplot([data[df['Borough'] == 'Bronx']['Average Total SAT Score'],
data[df['Borough'] == 'Brooklyn']['Average Total SAT Score'],
data[df['Borough'] == 'Manhattan']['Average Total SAT Score'],
data[df['Borough'] == 'Queens']['Average Total SAT Score'],
data[df['Borough'] == 'Staten Island']['Average Total SAT Score']],
labels=['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'])
# Add labels and a title to the plot
plt.xlabel('Borough')
plt.ylabel('Average Total SAT Score')
plt.title('Average Total SAT Score by Borough in NYC')
# Show the plot
plt.show()
This boxplot visualisation confirms what we had studied earlier. Borough such as Manhattan, Queens and Staten Island have higher median SAT score than the other two boroughs. Could there be influences behind this? We will see later on.
!pip install folium
!pip install bokeh
import folium
import json
from branca.colormap import LinearColormap
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
from shapely.geometry import Polygon, LineString, Point
from branca.colormap import linear
# Load the NYC map data
area = json.load(open('/home/jovyan/Project1/Data/Borough Boundaries.geojson', 'r'))
# Create a map of NYC and plot just the zipcodes
nyc = folium.Map(location=[40.7128, -74.0060], tiles='cartodbpositron', zoom_start = 10)
folium.GeoJson(area).add_to(nyc)
# Load the dataset and calculate the average SAT score by coordinates
sat_scores_by_coordinates = df.groupby(["Longitude", "Latitude"])["Average Total SAT Score"].mean()
# Create a pivot table and extract the data for the heatmap
pivot_table = sat_scores_by_coordinates.reset_index().pivot_table(values="Average Total SAT Score", index=["Latitude", "Longitude"], aggfunc=["mean", "count"])
pivot_table.columns = ["avg_sat_score", "num_schools"]
heatmap_data = pivot_table.reset_index()[["Latitude", "Longitude", "avg_sat_score"]]
# Define the colormap to use
cmap = LinearColormap(colors=['green', 'yellow', 'red'], vmin=heatmap_data["avg_sat_score"].min(), vmax=heatmap_data["avg_sat_score"].max())
# Create the circle markers layer and add it to the map
for index, row in heatmap_data.iterrows():
folium.CircleMarker((row["Latitude"], row["Longitude"]),
radius=3,
weight=1.2,
color=None,
fill_color=cmap(row["avg_sat_score"]),
fill_opacity=2).add_to(nyc)
# Create the colorbar legend
cmap.caption = "Average SAT Score"
cmap.add_to(nyc)
# Display the map
nyc
We have generated a heatmap showing the distribution of SAT scores across NYC, with borough names marked for easy identification. This map visualizes the relationship between our Y variable, SAT scores, and our X variable, location (boroughs). The map is dominated by green dots, indicating an average score of around 1300. However, a closer look at boroughs such as Manhattan and Queens reveals a concentration of yellow, orange, and red dots, indicating higher average scores than in other regions. In fact, we observed 11 schools in Manhattan with SAT scores above 1500.
Conversely, the Bronx stands out for having the majority of schools with SAT scores between 1000-1300, with only 2 schools surpassing this range. These findings align with our previous analysis of the "Average Total SAT Score per Borough" barchart, indicating that schools in "poorer" boroughs tend to perform worse than those in more affluent areas.
Our heatmap also sheds light on the high scores reported in Staten Island. We note that only 10 schools are represented, but they have significantly higher scores, which explains the borough's overall high average score.
The visualization clearly highlights the considerable disparity in SAT scores across various boroughs, which we already touched upon with the "Median Household Income" map, indicating the need to address educational inequalities in these areas.
#This dataframe allows us to find the dominant ethnic group per school, which we will use in the following map
# Filter the DataFrame to only include the columns you want to calculate the mean for
cols_to_mean = ['Borough', 'Percent White', 'Percent Black', 'Percent Asian', 'Percent Hispanic']
data_filtered = df[cols_to_mean + ['Longitude', 'Latitude']]
# Group the data by latitude and longitude and calculate the mean for each group
grouped_data = data_filtered.groupby(['Longitude', 'Latitude'])
mean_data = grouped_data.mean(numeric_only=True)
# Alternatively, you can use the apply function to find the dominant race for each row
def find_dominant_race(row):
race_percentages = row[['Percent White', 'Percent Black', 'Percent Asian', 'Percent Hispanic']]
max_percentage = race_percentages.max()
dominant_race = race_percentages[race_percentages == max_percentage].index[0].split()[-1]
return dominant_race
mean_data['Dominant Race'] = mean_data.apply(find_dominant_race, axis=1)
mean_data.head()
| Percent White | Percent Black | Percent Asian | Percent Hispanic | Dominant Race | ||
|---|---|---|---|---|---|---|
| Longitude | Latitude | |||||
| -74.19215 | 40.52823 | 79.9 | 1.8 | 5.10 | 11.8 | White |
| -74.15785 | 40.58202 | 58.6 | 11.3 | 7.25 | 19.4 | White |
| -74.14211 | 40.63384 | 21.7 | 27.9 | 7.00 | 42.8 | Hispanic |
| -74.12310 | 40.59865 | 47.7 | 10.9 | 13.40 | 27.0 | White |
| -74.11536 | 40.56791 | 52.2 | 1.0 | 41.10 | 5.2 | White |
import json
import folium
# Load area shapefile
area = json.load(open('/home/jovyan/Project1/Data/Borough Boundaries.geojson', 'r'))
# Create map of NYC and plot just the zipcodes
nyc = folium.Map(location= [40.7128, -74.0060], tiles='cartodbpositron', width='100%', height='100%')
# Add the shape of NYC to the map
folium.GeoJson(area, name='NYC').add_to(nyc)
# Define color map for race categories
race_colors = {'White': 'blue', 'Black': 'purple', 'Asian': 'green', 'Hispanic': 'orange'}
# Create a FeatureGroupSubGroup for each race category
fg_white = folium.plugins.FeatureGroupSubGroup(nyc, 'White')
fg_black = folium.plugins.FeatureGroupSubGroup(nyc, 'Black')
fg_asian = folium.plugins.FeatureGroupSubGroup(nyc, 'Asian')
fg_hispanic = folium.plugins.FeatureGroupSubGroup(nyc, 'Hispanic')
# Iterate through each row in the mean data and add a circle marker to the corresponding FeatureGroupSubGroup
for _, row in mean_data.iterrows():
# Get the dominant race and corresponding color for this location
dominant_race = row['Dominant Race']
color = race_colors[dominant_race]
# Add a circle marker to the corresponding FeatureGroupSubGroup for this location
folium.CircleMarker(location=[row.name[1], row.name[0]], radius=2, color=color, fill=True, fill_color=color).add_to(eval('fg_'+dominant_race.lower()))
# Add the FeatureGroupSubGroups to the map
fg_white.add_to(nyc)
fg_black.add_to(nyc)
fg_asian.add_to(nyc)
fg_hispanic.add_to(nyc)
# Add legend to the map
legend = folium.features.CustomIcon('/home/jovyan/Project1/Code/IMG_0065 3.jpg', icon_size=(100, 100))
folium.Marker(location=[45, -74.0060], icon=legend, popup='Legend').add_to(nyc)
folium.map.LayerControl('topleft', collapsed=False, prefix = '').add_to(nyc)
# Display the map
nyc
On this map, we plot ethnicity by zipcode as our X variable to better understand its relation to our Y variable, SAT scores. Using an interactive map, we can explore the distribution of the predominant ethnic group in each school across NYC. Staten Island stands out as the borough with the highest concentration of predominantly white schools, which may explain their higher SAT scores compared to other areas of the city.
However, the distribution of ethnic groups varies greatly across the other boroughs. In Manhattan and the Bronx, the majority of schools have a predominantly Hispanic student population, while Brooklyn has a higher concentration of predominantly black schools. Queens, on the other hand, has a more diverse mix of predominantly Black, Hispanic, and Asian student populations.
While it may be tempting to make comparisons between schools with a certain ethnic makeup and their SAT scores, it's important to note that individual student performance is influenced by a variety of factors, including socioeconomic status, access to resources, and educational opportunities. These factors can vary widely within schools, regardless of their predominant ethnic group.
Main message: evaluate the relationship between demographics and test scores
Data to add: population per zipcode
Initially, I was intrigued by the idea of examining the relationship between gender and SAT scores in New York City schools. My plan was to identify the dominant gender in a school and analyze the data accordingly. However, during my research, I discovered that the city's gender ratio is heavily skewed towards women, with only 94 men for every 100 women. This realization made me aware that my analysis could produce misleading results, with women appearing as the dominant gender in most schools.
The website does not provide an API, which means that I need to use HTML web scraping techniques to extract the data.
It appears that the data on the website is updated annually since it presents 2023 data. Therefore, there is no need to update the data frequently in this case.
I tried many different ways to scrape the data but could not figure it out. On the website itself, when I inspected the HTML, it was clearly divided with "tr", "td". However, when using BeautifulSoup to show me the HTML output on Jupyter, I could not find clearly defined table rows and table data cells. Instead, the data is given as such "{"zip":"11368","population":108661,"city":"Corona","county":"Queens"}". It is a list of dictionnaries.
I could maybe do a for loop to extract data from the dictionnary one by one. I tried but I could not figure out exactly how to make it work.
From the "inspect" element on the webpage I found the class as "jsx-a3119e4553b2cac7". However, when I downloaded the HTML using BeautifulSoup it returned a list of dictionnaries instead.
I decided to use the website that the instructor provided as it was easier to scrape.
import requests
from bs4 import BeautifulSoup
# Set the URL we want to scrape from
url = "https://www.newyork-demographics.com/zip_codes_by_population"
# Connect to URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
soup_object = BeautifulSoup(response.content)
# Find the table containing the data
table = soup_object.find('table', 'ranklist table')
# Extract the data from each row of the table
data = table.find_all('tr')
demo = pd.DataFrame(columns = ['Zip Code', 'Population'])
for row in data[1:]:
cols = row.find_all('td')
if len(cols) >= 3:
zipcode = cols[1].text.strip()
population = cols[2].text.strip()
demo.loc[len(demo)] = [zipcode, population]
# Display the dataframe
demo.head()
| Zip Code | Population | |
|---|---|---|
| 0 | 11368 | 116,469 |
| 1 | 11385 | 109,111 |
| 2 | 11208 | 107,724 |
| 3 | 11236 | 102,238 |
| 4 | 10467 | 102,209 |
Although the guidelines suggested that we should outer merge the data, I don't think it would be useful for my project. Even though I merged the data on zipcodes, the resulting merged dataset contains more zipcodes (2034) than there are schools in NYC in our original dataset (374). So, I am going to just merge on Zip Code with no other constraints.
pop = pd.to_numeric(demo['Population'].str.replace(',', ''))
demo['Population'] = pop
demo['Zip Code'] = demo['Zip Code'].astype('int64')
population_df = df.merge(demo, on='Zip Code')
population_df.head()
| School ID | School Name | Borough | Building Code | Street Address | City | State | Zip Code | Latitude | Longitude | ... | Percent White | Percent Black | Percent Hispanic | Percent Asian | Average Score (SAT Math) | Average Score (SAT Reading) | Average Score (SAT Writing) | Percent Tested | Average Total SAT Score | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01M539 | New Explorations into Science, Technology and ... | Manhattan | M022 | 111 Columbia Street | Manhattan | NY | 10002 | 40.71873 | -73.97943 | ... | 28.6 | 13.3 | 18.0 | 38.5 | 657.0 | 601.0 | 601.0 | 91.00% | 1859.0 | 76807 |
| 1 | 02M294 | Essex Street Academy | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | ... | 11.7 | 38.5 | 41.3 | 5.9 | 395.0 | 411.0 | 387.0 | 78.90% | 1193.0 | 76807 |
| 2 | 02M308 | Lower Manhattan Arts Academy | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | ... | 3.1 | 28.2 | 56.9 | 8.6 | 418.0 | 428.0 | 415.0 | 65.10% | 1261.0 | 76807 |
| 3 | 02M545 | High School for Dual Language and Asian Studies | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | ... | 1.7 | 3.1 | 5.5 | 88.9 | 613.0 | 453.0 | 463.0 | 95.90% | 1529.0 | 76807 |
| 4 | 01M292 | Henry Street School for International Studies | Manhattan | M056 | 220 Henry Street | Manhattan | NY | 10002 | 40.71376 | -73.98526 | ... | 3.9 | 24.4 | 56.6 | 13.2 | 410.0 | 406.0 | 381.0 | 59.70% | 1197.0 | 76807 |
5 rows × 24 columns
demo['Zip Code'] = demo['Zip Code'].astype('int64')
#Create heat map of NYC and its population per zipcode=
# Load NYC zipcodes geojson file
nyc_zipcodes = json.load(open('/home/jovyan/Project1/Data/map.geojson', 'r'))
# Create map of NYC and plot just the zipcodes
nyc_map = folium.Map(location=[40.7128, -74.0060], tiles='cartodbpositron', width='100%', height='100%')
# Add the shape of NYC to the map
folium.GeoJson(nyc_zipcodes).add_to(nyc_map)
# Create a choropleth layer based on average total SAT scores per zipcode
folium.Choropleth(
geo_data=nyc_zipcodes,
name='choropleth',
data=demo,
columns=['Zip Code', 'Population'],
key_on='feature.properties.postalCode',
fill_color='Blues',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Population',
).add_to(nyc_map)
# Add layer control to toggle between zipcodes and choropleth
folium.LayerControl().add_to(nyc_map)
# Display map
nyc_map
The map we have created with the borough markers provides us with insightful information about the population distribution in New York. As we hover over the markers, we can see the names of each borough displayed. The visualization clearly shows that boroughs such as the Bronx and Brooklyn have a larger population than the other boroughs, which is indicated by the darker red areas. In contrast, the east of Queens and the south of Manhattan seem to have a much smaller population than the rest of New York.
This observation is interesting, as we had previously hypothesized that higher population leads to higher SAT scores due to the availability of more resources. However, the data shows that the boroughs with the highest average SAT scores, such as Staten Island, Manhattan, and Queens, are actually less densely populated neighborhoods. This finding contradicts our original hypothesis and prompts us to reconsider our assumptions.
import matplotlib.pyplot as plt
import pandas as pd
# Create a boxplot for median income by borough
plt.boxplot([population_df[population_df['Borough'] == 'Bronx']['Population'],
population_df[population_df['Borough'] == 'Brooklyn']['Population'],
population_df[population_df['Borough'] == 'Manhattan']['Population'],
population_df[population_df['Borough'] == 'Queens']['Population'],
population_df[population_df['Borough'] == 'Staten Island']['Population']],
labels=['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'])
# Add labels and a title to the plot
plt.xlabel('Borough')
plt.ylabel('Population')
plt.title('Population by Borough in NYC')
# Show the plot
plt.show()
This boxplot acts as an extension of the map to clearly visualize population size. It is clear that Brooklyn and Bronx have a higher median population than the other neighbourhoods.
For this part of the project I attempted to find a webpage for web scraping that did not deny me access but most relevant websites did not allow web scraping or were already in CSV format. I found an article that listed about 10 to 20 neighborhoods in NYC viewed as unsafe, but it only mentioned the neighborhood name and not the zip code.I could not find a website that listed the exact names of the neighborhoods mentioned in the article and their corresponding zip code. So, From Wikipedia, I discovered that NYC has Community Boards that make up the city, and these Community Boards include zip code and neighborhood name data.I scraped the information from Wikipedia and merged zip code and neighborhood data on "Community Board." Finally, I merged that dataset with the dataset I had originally scraped from the unsafe neighborhoods article, on "Neighborhood Name." In the end I managed to obtain the required dataframe. I recognize from Project 3 comments that I did not need to do all this but I am still choosing to include it as it ties in all together to my final visualisations of this part.
#Most unsafe neighbourhoods in New York City
#Problem: need zipcode data
import requests
from bs4 import BeautifulSoup
# Make a GET request to the website
url = "https://usaestaonline.com/most-dangerous-neighborhoods-in-new-york-city"
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the post_content div
post_content = soup.find("div", class_="post_content")
# Find all the h3 tags in the post_content div
h3_tags = post_content.find_all("h3")
# Create an empty list to store the neighborhoods
neighborhoods = []
# Loop through the h3 tags to extract the rank and name of each neighborhood
for h3 in h3_tags:
rank = h3.text.split(".")[0] # Extract the rank
name_split = h3.text.split(".")
if len(name_split) >= 2:
name = name_split[1].strip() # Extract the name
neighborhoods.append({"Rank": rank, "Neighbourhood": name})
# Convert the neighborhoods list into a pandas dataframe
danger = pd.DataFrame(neighborhoods)
danger.head()
| Rank | Neighbourhood | |
|---|---|---|
| 0 | 1 | Brownsville |
| 1 | 2 | Midtown |
| 2 | 3 | Bedford |
| 3 | 4 | Hunts Point |
| 4 | 5 | Mott Haven |
#Scrape zipcode data by getting NYC Community Board data
# Make a GET request to the Wikipedia page
url = "https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City"
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the first table in the page
table = soup.find_all("table")[0]
# Find all the rows in the table
rows = table.find_all("tr")
# Create an empty list to store the neighborhoods and community boards
neighbourhoods = []
# Loop through the rows to extract the data
for row in rows:
# Find all the cells in the row
cells = row.find_all("td")
# Check if the row contains data and not just headers
if len(cells) > 0:
# Extract the data from the cells
community_board = cells[0].text.strip()
neighbourhood = cells[4].text.strip()
# Add the data to the neighborhoods list
neighbourhoods.append({"Community Board": community_board, "Neighbourhood": neighbourhood})
# Convert the neighborhoods list into a pandas dataframe
neighbourhood_df = pd.DataFrame(neighbourhoods)
# Convert the "Community Board" column to strings
neighbourhood_df["Community Board"] = neighbourhood_df["Community Board"].astype(str)
# Print the dataframe
neighbourhood_df.head()
| Community Board | Neighbourhood | |
|---|---|---|
| 0 | Bronx CB 1 | Melrose, Mott Haven, Port Morris |
| 1 | Bronx CB 2 | Hunts Point, Longwood |
| 2 | Bronx CB 3 | Claremont, Concourse Village, Crotona Park, Mo... |
| 3 | Bronx CB 4 | Concourse, Highbridge, Mount Eden |
| 4 | Bronx CB 5 | Fordham, Morris Heights, Mount Hope, Universit... |
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Make a GET request to the Wikipedia page
url = "https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City"
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the first table in the page
table = soup.find_all("table")[0]
# Find all the rows in the table
rows = table.find_all("tr")
# Create an empty list to store the neighborhoods and community boards
neighbourhoods = []
# Loop through the rows to extract the data
for row in rows:
# Find all the cells in the row
cells = row.find_all("td")
# Check if the row contains data and not just headers
if len(cells) > 0:
# Extract the data from the cells
community_board = cells[0].text.strip()
neighbourhood = cells[4].text.strip()
# Add the data to the neighborhoods list
neighbourhoods.append({"Community Board": community_board, "Neighbourhood": neighbourhood})
# Convert the neighborhoods list into a pandas dataframe
neighbourhood_df = pd.DataFrame(neighbourhoods)
# Print the dataframe
neighbourhood_df.head()
| Community Board | Neighbourhood | |
|---|---|---|
| 0 | Bronx CB 1 | Melrose, Mott Haven, Port Morris |
| 1 | Bronx CB 2 | Hunts Point, Longwood |
| 2 | Bronx CB 3 | Claremont, Concourse Village, Crotona Park, Mo... |
| 3 | Bronx CB 4 | Concourse, Highbridge, Mount Eden |
| 4 | Bronx CB 5 | Fordham, Morris Heights, Mount Hope, Universit... |
#Get zipcodes for each Community Board per borough
def cb_zipcodes_brooklyn(start=1, end=18):
dfs = []
for cb in range(start, end+1):
# Generate the URL and community board name
url = f"https://en.wikipedia.org/wiki/Brooklyn_Community_Board_{cb}"
cb_name = f"Brooklyn CB {cb}"
# Make a GET request to the website
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the div containing the ZIP codes
zip_div = soup.find("div", {"class": "postal-code"})
# Extract the ZIP codes from the div and concatenate them into a single row
zip_codes = ", ".join(zip_div.text.strip().replace(" and ", ", ").split(", "))
# Add a row with the name of the community board and its ZIP codes to the data frame
dfs.append(pd.DataFrame({"Community Board": [cb_name], "ZIP code": [zip_codes]}))
# Concatenate the data frames for each community board into a single data frame
df = pd.concat(dfs, ignore_index=True)
return df
brooklyn = cb_zipcodes_brooklyn(1, 18)
brooklyn.head()
| Community Board | ZIP code | |
|---|---|---|
| 0 | Brooklyn CB 1 | 11206, 11211, 11222 |
| 1 | Brooklyn CB 2 | 11201, 11205, 11217, 11238,, 11251 |
| 2 | Brooklyn CB 3 | 11205, 11206, 11216, 11221, 11233,, 11238 |
| 3 | Brooklyn CB 4 | 11206, 11207, 11221,, 11237 |
| 4 | Brooklyn CB 5 | 11207, 11208,, 11239 |
def cb_zipcodes_bronx(start=1, end=18):
dfs = []
for cb in range(start, end+1):
# Generate the URL and community board name
url = f"https://en.wikipedia.org/wiki/Bronx_Community_Board_{cb}"
cb_name = f"Bronx CB {cb}"
# Make a GET request to the website
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the div containing the ZIP codes
zip_div = soup.find("div", {"class": "postal-code"})
# Check if zip_div is None
if zip_div is None:
print(f"No ZIP codes found for {cb_name}")
continue
# Extract the ZIP codes from the div and concatenate them into a single row
zip_codes = ", ".join(zip_div.text.strip().replace(" and ", ", ").split(","))
# Add a row with the name of the community board and its ZIP codes to the data frame
dfs.append(pd.DataFrame({"Community Board": [cb_name], "ZIP code": [zip_codes]}))
# Concatenate the data frames for each community board into a single data frame
df = pd.concat(dfs, ignore_index=True)
return df
bronx = cb_zipcodes_bronx(start=1, end=12)
bronx.head()
No ZIP codes found for Bronx CB 8
| Community Board | ZIP code | |
|---|---|---|
| 0 | Bronx CB 1 | 10451, 10454, 10455, , 10456 |
| 1 | Bronx CB 2 | 10455, 10459, , 10474 |
| 2 | Bronx CB 3 | 10456, 10459, , 10460 |
| 3 | Bronx CB 4 | 10451, 10452, , 10456 |
| 4 | Bronx CB 5 | 10452, 10453, 10457, 10458, , 10468 |
def cb_zipcodes_manhattan(start=1, end=18):
dfs = []
for cb in range(start, end+1):
# Generate the URL and community board name
url = f"https://en.wikipedia.org/wiki/Manhattan_Community_Board_{cb}"
cb_name = f"Manhattan CB {cb}"
# Make a GET request to the website
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the div containing the ZIP codes
zip_div = soup.find("div", {"class": "postal-code"})
# Check if zip_div is None
if zip_div is None:
print(f"No ZIP codes found for {cb_name}")
continue
# Extract the ZIP codes from the div and concatenate them into a single row
zip_codes = ", ".join(zip_div.text.strip().replace(" and ", ", ").split(","))
# Add a row with the name of the community board and its ZIP codes to the data frame
dfs.append(pd.DataFrame({"Community Board": [cb_name], "ZIP code": [zip_codes]}))
# Concatenate the data frames for each community board into a single data frame
df = pd.concat(dfs, ignore_index=True)
return df
manhattan = cb_zipcodes_manhattan(start=1, end=12)
manhattan.head()
No ZIP codes found for Manhattan CB 1 No ZIP codes found for Manhattan CB 2 No ZIP codes found for Manhattan CB 10
| Community Board | ZIP code | |
|---|---|---|
| 0 | Manhattan CB 3 | 10002, 10003, 10007, 10009, 10013, 100038 |
| 1 | Manhattan CB 4 | 10001, 10011, 10018, 10019, 10036 |
| 2 | Manhattan CB 5 | 10003, 10010, 10011, 10016, 10017, 10018,... |
| 3 | Manhattan CB 6 | 10003, 10009, 10010, 10016, 10017, , 10022 |
| 4 | Manhattan CB 7 | 10023, 10024, 10025, 10069 |
def cb_zipcodes_queens(start=1, end=18):
dfs = []
for cb in range(start, end+1):
# Generate the URL and community board name
url = f"https://en.wikipedia.org/wiki/Queens_Community_Board_{cb}"
cb_name = f"Queens CB {cb}"
# Make a GET request to the website
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the div containing the ZIP codes
zip_div = soup.find("div", {"class": "postal-code"})
# Check if zip_div is None
if zip_div is None:
print(f"No ZIP codes found for {cb_name}")
continue
# Extract the ZIP codes from the div and concatenate them into a single row
zip_codes = ", ".join(zip_div.text.strip().replace(" and ", ", ").split(","))
# Add a row with the name of the community board and its ZIP codes to the data frame
dfs.append(pd.DataFrame({"Community Board": [cb_name], "ZIP code": [zip_codes]}))
# Concatenate the data frames for each community board into a single data frame
df = pd.concat(dfs, ignore_index=True)
return df
queens = cb_zipcodes_queens(start=1, end=14)
queens.head()
No ZIP codes found for Queens CB 9 No ZIP codes found for Queens CB 10 No ZIP codes found for Queens CB 11 No ZIP codes found for Queens CB 13
| Community Board | ZIP code | |
|---|---|---|
| 0 | Queens CB 1 | 11101, 11102, 11103, 11105, 11106, , 11370 |
| 1 | Queens CB 2 | 11101, 11104, 11377, 11378 |
| 2 | Queens CB 3 | 11368, 11389, 11370, , 11372 |
| 3 | Queens CB 4 | 11368, 11373, , 11377 |
| 4 | Queens CB 5 | 11374, 11378, 11379, , 11385 |
def cb_zipcodes_staten(start=1, end=18):
dfs = []
for cb in range(start, end+1):
# Generate the URL and community board name
url = f"https://en.wikipedia.org/wiki/Staten_Island_Community_Board_{cb}"
cb_name = f"Staten Island CB {cb}"
# Make a GET request to the website
response = requests.get(url)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")
# Find the div containing the ZIP codes
zip_div = soup.find("div", {"class": "postal-code"})
# Check if zip_div is None
if zip_div is None:
print(f"No ZIP codes found for {cb_name}")
continue
# Extract the ZIP codes from the div and concatenate them into a single row
zip_codes = ", ".join(zip_div.text.strip().replace(" and ", ", ").split(","))
# Add a row with the name of the community board and its ZIP codes to the data frame
dfs.append(pd.DataFrame({"Community Board": [cb_name], "ZIP code": [zip_codes]}))
# Concatenate the data frames for each community board into a single data frame
df = pd.concat(dfs, ignore_index=True)
return df
staten = cb_zipcodes_staten(start=1, end=3)
staten
| Community Board | ZIP code | |
|---|---|---|
| 0 | Staten Island CB 1 | 10301, 10302, 10303, 10304, 10305, 10310,... |
| 1 | Staten Island CB 2 | 10301, 10304, 10305, 10306, , 10314 |
| 2 | Staten Island CB 3 | 10306, 10307, 10308, 10309, 10312 |
#Call all the zipcode per borough data into a singular dataframe
# Call the four functions to get data frames for each borough
bronx_df = bronx
brooklyn_df = brooklyn
manhattan_df = manhattan
queens_df = queens
staten_df = staten
# Merge the data frames by column using pd.concat()
zipcode = pd.concat([bronx_df, brooklyn_df, manhattan_df, queens_df, staten_df], axis=0)
# Rename the columns
zipcode.columns = ["Community Board", "Zipcodes"]
# Convert the values in the "Community Board" column to strings
zipcode["Community Board"] = zipcode["Community Board"].astype(str)
# Sort the data frame by Community Boards
zipcode = zipcode.sort_values(by=["Community Board"], ignore_index=True)
# Show the final merged data frame
zipcode.head()
| Community Board | Zipcodes | |
|---|---|---|
| 0 | Bronx CB 1 | 10451, 10454, 10455, , 10456 |
| 1 | Bronx CB 10 | 10461, 10465, 10467, , 10475 |
| 2 | Bronx CB 11 | 10460, 10461, 10462, 10467, , 10469 |
| 3 | Bronx CB 12 | 10460, 10466, 10467, 10469, 10470, , 10475 |
| 4 | Bronx CB 2 | 10455, 10459, , 10474 |
zipcode['Community Board'] = zipcode['Community Board'].str.replace('CB','').str.strip()
#neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('CB','').str.strip()
#Clean and merge all the datasets together
zipcode['Community Board'] = zipcode['Community Board'].str.replace('CB','').str.strip()
neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('\xa0CB\xa0', '')
neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('Bronx', 'Bronx ')
neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('Brooklyn', 'Brooklyn ')
neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('Manhattan', 'Manhattan ')
neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('Queens', 'Queens ')
neighbourhood_df['Community Board'] = neighbourhood_df['Community Board'].str.replace('Staten Island', 'Staten Island ')
merged_df = pd.merge(neighbourhood_df, zipcode, on='Community Board', how='outer')
merged_df.head()
| Community Board | Neighbourhood | Zipcodes | |
|---|---|---|---|
| 0 | Bronx 1 | Melrose, Mott Haven, Port Morris | 10451, 10454, 10455, , 10456 |
| 1 | Bronx 2 | Hunts Point, Longwood | 10455, 10459, , 10474 |
| 2 | Bronx 3 | Claremont, Concourse Village, Crotona Park, Mo... | 10456, 10459, , 10460 |
| 3 | Bronx 4 | Concourse, Highbridge, Mount Eden | 10451, 10452, , 10456 |
| 4 | Bronx 5 | Fordham, Morris Heights, Mount Hope, Universit... | 10452, 10453, 10457, 10458, , 10468 |
# Create a dictionary to store the neighborhoods and their zipcodes
neighborhoods = {}
# Loop through each row in the merged_df dataset
for index, row in merged_df.iterrows():
# Check if the value in the "Neighbourhood" column is a string
if isinstance(row['Neighbourhood'], str):
# Split the neighborhood names in the row and convert them into a list
neighborhood_list = row['Neighbourhood'].split(', ')
# Loop through each neighborhood in the list
for neighborhood in neighborhood_list:
# Check if the neighborhood is in the danger: Neighbourhood dataset
if neighborhood in danger['Neighbourhood'].values:
# Add the neighborhood and its associated zipcodes to the neighborhoods dictionary
neighborhoods[neighborhood] = row['Zipcodes']
# Create a pandas dataframe from the neighborhoods dictionary
neighborhoods_df = pd.DataFrame({'Neighborhood': list(neighborhoods.keys()), 'Zip Code': list(neighborhoods.values())})
# Print out the dataframe
neighborhoods_df
| Neighborhood | Zip Code | |
|---|---|---|
| 0 | Mott Haven | 10451, 10454, 10455, , 10456 |
| 1 | Hunts Point | 10455, 10459, , 10474 |
| 2 | Fordham | 10452, 10453, 10457, 10458, , 10468 |
| 3 | Norwood | 10453, 10458, 10463, 10467, 10468 |
| 4 | Soundview | 10462, 10472, , 10473 |
| 5 | Fort Greene | 11201, 11205, 11217, 11238,, 11251 |
| 6 | Vinegar Hill | 11201, 11205, 11217, 11238,, 11251 |
| 7 | Ocean Hill | 11212, 11233 |
| 8 | Brownsville | 11212, 11233 |
| 9 | Midtown | 10003, 10010, 10011, 10016, 10017, 10018,... |
| 10 | East Harlem | 10029, 10035, , 10037 |
neighborhoods_df['Zip Code'] = neighborhoods_df['Zip Code'].astype(str)
new_rows = []
for _, row in neighborhoods_df.iterrows():
neighborhood = row['Neighborhood']
zip_codes = row['Zip Code'].split(',')
for zip_code in zip_codes:
zip_code = zip_code.strip()
if zip_code:
new_rows.append({'Unsafe': neighborhood, 'Zip Code': zip_code})
new_danger_df = pd.DataFrame(new_rows)
merged_df2 = pd.merge(neighborhoods_df, new_danger_df, on='Zip Code', how='right')
merged_df2.head()
| Neighborhood | Zip Code | Unsafe | |
|---|---|---|---|
| 0 | NaN | 10451 | Mott Haven |
| 1 | NaN | 10454 | Mott Haven |
| 2 | NaN | 10455 | Mott Haven |
| 3 | NaN | 10456 | Mott Haven |
| 4 | NaN | 10455 | Hunts Point |
merged_df2['Zip Code'] = merged_df2['Zip Code'].astype('int64')
unsafe_df = pd.merge(df, merged_df2, on='Zip Code', how = 'outer')
unsafe_df = unsafe_df.drop(['Neighborhood'], axis=1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Drop rows with NaN values in the Average Total SAT Score or Unsafe columns
unsafe_df = unsafe_df.dropna(subset=['Average Total SAT Score', 'Unsafe'])
# Calculate the interquartile range for the Average Total SAT Score column
Q1 = unsafe_df['Average Total SAT Score'].quantile(0.25)
Q3 = unsafe_df['Average Total SAT Score'].quantile(0.75)
IQR = Q3 - Q1
# Filter out any values that are below Q1 - 1.5*IQR or above Q3 + 1.5*IQR
unsafe_df = unsafe_df[(unsafe_df['Average Total SAT Score'] >= Q1 - 1.5*IQR) & (unsafe_df['Average Total SAT Score'] <= Q3 + 1.5*IQR)]
# Split the filtered dataset into safe and unsafe schools
safe_df = unsafe_df[unsafe_df['Unsafe'].isna()]
unsafe_df = unsafe_df[unsafe_df['Unsafe'].notna()]
# Set the font size for the axis labels and legend
sns.set(font_scale=0.8)
# Set the size of the figure
fig = plt.figure(figsize=(15, 6))
# Create a subplot for the box plot of Average Total SAT Score
ax1 = fig.add_subplot(121)
sns.boxplot(y='Average Total SAT Score', data=population_df, color='blue', orient='v', ax=ax1, showfliers=False)
ax1.set_ylabel('Average Total SAT Score')
ax1.set_title('Distribution of Average Total SAT Scores')
# Add a boxplot of the average total SAT score to the same subplot
sns.boxplot(y='Average Total SAT Score', data=safe_df, color='red', orient='v', ax=ax1)
# Create a subplot for the box plot of safe and unsafe schools
ax2 = fig.add_subplot(122)
sns.boxplot(x='Unsafe', y='Average Total SAT Score', data=pd.concat([safe_df, unsafe_df]), color='blue', ax=ax2)
ax2.set_xlabel('Unsafe Neighbourhoods')
ax2.set_ylabel('Average Total SAT Score')
ax2.set_title('Distribution of Average Total SAT Scores for Schools from Less Safe Neighbourhoods')
ax2.tick_params(axis='x', labelrotation=45)
ax2.set_ylim(ax1.get_ylim())
plt.show()
By analyzing the boxplots, we can observe that the median Average Total SAT Score of schools situated in less safe neighborhoods is lower than the overall median SAT score across all boroughs. While a few neighborhoods like Midtown and East Harlem have higher medians, the majority of the other neighborhoods have medians below the city average. This implies that students hailing from disadvantaged neighborhoods are associated with lower SAT scores compared to the city average.
# Load data
nyc_zipcodes = json.load(open('/home/jovyan/Project1/Data/map.geojson', 'r'))
# Replace NaN values in Unsafe column with 0 to represent safe
# Convert Unsafe column to numeric data type
unsafe_df['Unsafe'] = pd.to_numeric(unsafe_df['Unsafe'], errors='coerce')
unsafe_df['Unsafe'].fillna(0, inplace=True)
# Create map of NYC and plot just the zipcodes
nyc_map = folium.Map(location=[40.7128, -74.0060], tiles='cartodbpositron', width='100%', height='100%')
# Add the shape of NYC to the map
folium.GeoJson(nyc_zipcodes).add_to(nyc_map)
# Create choropleth map
choropleth = folium.Choropleth(
geo_data=nyc_zipcodes,
name='choropleth',
data=unsafe_df,
columns=['Zip Code', 'Unsafe'],
key_on='feature.properties.postalCode',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Safety level',
highlight=True,
nan_fill_color='green',
nan_fill_opacity=0.5,
).add_to(nyc_map)
# add legend
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['postalCode'], labels=False)
)
# display map
nyc_map
By using this map, we can gain further insight into the relationship between unsafe neighborhoods and lower SAT scores. The concentration of unsafe neighborhoods in the Bronx, Brooklyn, and Manhattan is particularly noteworthy. It's interesting to note that the Bronx, which has the lowest average SAT scores in our dataset, has a particularly high concentration of unsafe neighborhoods.
This observation suggests that the location of schools in less safe neighborhoods could be a contributing factor to lower SAT scores. These schools may face greater challenges in terms of resource access and the socio-economic background of the students, making it more difficult for them to achieve high scores on standardized tests like the SAT. This underscores the need to address the root causes of educational disparities in disadvantaged communities and to provide support for students and schools located in these neighborhoods.
Initially, I planned to use income data for project 3. However, I encountered difficulties in finding viable websites to scrape the required information from. Therefore, I opted to download a pre-existing dataset with the relevant data. I believe that incorporating income data into the project will add a valuable layer of analysis and insight to the final results.
income = pd.read_csv('/home/jovyan/Project1/Data/ACSST5Y2021.S1901-Data.csv')
income = income.rename(columns={'Zipcode': 'Zip Code'})
income['Zip Code'] = income['Zip Code'].astype('int64')
income.head()
| Zip Code | Median Income | |
|---|---|---|
| 0 | 6390 | 46250.0 |
| 1 | 10001 | 101409.0 |
| 2 | 10002 | 37093.0 |
| 3 | 10003 | 137533.0 |
| 4 | 10004 | 216017.0 |
final_df = pd.merge(income, population_df, on='Zip Code')
final_df.head()
| Zip Code | Median Income | School ID | School Name | Borough | Building Code | Street Address | City | State | Latitude | ... | Percent Black | Percent Hispanic | Percent Asian | Average Score (SAT Math) | Average Score (SAT Reading) | Average Score (SAT Writing) | Percent Tested | Average Total SAT Score | Population | ln(Population) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10002 | 37093.0 | 01M539 | New Explorations into Science, Technology and ... | Manhattan | M022 | 111 Columbia Street | Manhattan | NY | 40.71873 | ... | 13.3 | 18.0 | 38.5 | 657.0 | 601.0 | 601.0 | 91.00% | 1859.0 | 76807 | 11.249051 |
| 1 | 10002 | 37093.0 | 02M294 | Essex Street Academy | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 40.71687 | ... | 38.5 | 41.3 | 5.9 | 395.0 | 411.0 | 387.0 | 78.90% | 1193.0 | 76807 | 11.249051 |
| 2 | 10002 | 37093.0 | 02M308 | Lower Manhattan Arts Academy | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 40.71687 | ... | 28.2 | 56.9 | 8.6 | 418.0 | 428.0 | 415.0 | 65.10% | 1261.0 | 76807 | 11.249051 |
| 3 | 10002 | 37093.0 | 02M545 | High School for Dual Language and Asian Studies | Manhattan | M445 | 350 Grand Street | Manhattan | NY | 40.71687 | ... | 3.1 | 5.5 | 88.9 | 613.0 | 453.0 | 463.0 | 95.90% | 1529.0 | 76807 | 11.249051 |
| 4 | 10002 | 37093.0 | 01M292 | Henry Street School for International Studies | Manhattan | M056 | 220 Henry Street | Manhattan | NY | 40.71376 | ... | 24.4 | 56.6 | 13.2 | 410.0 | 406.0 | 381.0 | 59.70% | 1197.0 | 76807 | 11.249051 |
5 rows × 26 columns
income['Zip Code'] = income['Zip Code'].astype('int64')
#Create heat map of NYC and its population per zipcode=
# Load NYC zipcodes geojson file
nyc_zipcodes = json.load(open('/home/jovyan/Project1/Data/map.geojson', 'r'))
# Create map of NYC and plot just the zipcodes
nyc_map = folium.Map(location=[40.7128, -74.0060], tiles='cartodbpositron', width='100%', height='100%')
# Add the shape of NYC to the map
folium.GeoJson(nyc_zipcodes).add_to(nyc_map)
# Create a choropleth layer based on average total SAT scores per zipcode
folium.Choropleth(
geo_data=nyc_zipcodes,
name='choropleth',
data=income,
columns=['Zip Code', 'Median Income'],
key_on='feature.properties.postalCode',
fill_color='YlGnBu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Median Income',
).add_to(nyc_map)
# Add layer control to toggle between zipcodes and choropleth
folium.LayerControl().add_to(nyc_map)
# Display map
nyc_map
By examining the income distribution across NYC using the map, we can see that income is generally evenly dispersed throughout the city. However, there is a clear disparity between the boroughs of Manhattan and the Bronx. The Bronx has a higher concentration of zip codes with lower incomes, ranging between 2,500 and 43,750, while Manhattan has more zip codes with higher incomes, ranging between 126,250 and 167,500.
Throughout our analysis, we have found that the Bronx has a higher population, lower income brackets, and lower SAT scores, with a population consisting mostly of Hispanic and Black individuals. On the other hand, Manhattan has a lower population, higher income brackets, a more diverse population, and higher SAT scores. Therefore, the income disparity between these two boroughs may be one of the contributing factors to the differences in academic performance between the schools in these areas.
import matplotlib.pyplot as plt
import pandas as pd
# Create a boxplot for median income by borough
plt.boxplot([final_df[final_df['Borough'] == 'Bronx']['Median Income'],
final_df[final_df['Borough'] == 'Brooklyn']['Median Income'],
final_df[final_df['Borough'] == 'Manhattan']['Median Income'],
final_df[final_df['Borough'] == 'Queens']['Median Income'],
final_df[final_df['Borough'] == 'Staten Island']['Median Income']],
labels=['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'])
# Add labels and a title to the plot
plt.xlabel('Borough')
plt.ylabel('Median Income ($)')
plt.title('Median Income by Borough in NYC')
# Show the plot
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Select the columns to be included in the correlation heatmap
cols = ['Average Total SAT Score', 'Percent White', 'Percent Asian', 'Percent Hispanic', 'Percent Black', 'Borough', 'Population', 'Median Income']
# Calculate the correlation matrix
corr = final_df[cols].corr(numeric_only=True)
# Create the correlation heatmap with a coolwarm color scheme
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title('Correlation Heatmap')
plt.show()
This last boxplot serves as confirmation for what we onsberved on the map. Manhattan leads with the highest income bracket followed by Staten Island, Queens, Brooklyn, and then finally Bronx.
This correlation heatmap provides a comprehensive overview of our findings. We discovered that a student's racial background had a significant impact on their average total SAT scores, with percent White and percent Asian being positively correlated with SAT scores, with a correlation coefficient of 0.62. On the other hand, percent Hispanic and percent Black had negative correlations with SAT scores, with correlation coefficients of -0.41 and -0.3, respectively.
Furthermore, we observed a slight negative relationship between population and SAT scores, which suggests that lower population densities are linked to higher SAT scores. Additionally, we found a weak positive correlation between income and SAT scores (0.27), indicating that median income is a weak predictor of SAT scores in our dataset.
Overall, this heatmap highlights the significant role that race and ethnicity play in determining student performance on standardized tests such as the SAT. It also supports our previous findings that factors such as population density and median income have a weaker but still notable impact on SAT scores.
import numpy as np
import matplotlib.pyplot as plt
# Plot scatter plot
plt.scatter(final_df['Median Income'], final_df['Average Total SAT Score'])
plt.xlabel('Median Income')
plt.ylabel('Average Total SAT Scores')
plt.title('Scatter Plot of Median Income vs Average Total SAT Scores')
# Add trendline
z = np.polyfit(final_df['Median Income'], final_df['Average Total SAT Score'], 1)
p = np.poly1d(z)
plt.plot(income['Median Income'],p(income['Median Income']),"r--")
# Calculate the correlation coefficient between population and SAT scores
correlation = income['Median Income'].corr(df['Average Total SAT Score'])
print('Correlation coefficient:', correlation)
# Show plot
plt.show()
Correlation coefficient: 0.010983973195306683
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Create a scatter plot with the population on the x-axis and the SAT score on the y-axis
plt.scatter(population_df['Population'], df['Average Total SAT Score'])
# Add labels and title
plt.xlabel('Population')
plt.ylabel('Average Total SAT Score')
plt.title('Relationship between population and SAT scores')
# Add trend line
z = np.polyfit(population_df['Population'], df['Average Total SAT Score'], 1)
p = np.poly1d(z)
plt.plot(population_df['Population'], p(population_df['Population']), "r--")
# Show the plot
plt.show()
# Calculate the correlation coefficient between population and SAT scores
correlation = population_df['Population'].corr(df['Average Total SAT Score'])
print('Correlation coefficient:', correlation)
Correlation coefficient: -0.09021330209020872
To gain insight into the possible factors influencing SAT scores, we have selected four X variables to focus on in our analysis, including boroughs, racial demographics, population, and median income. Each of these variables could provide valuable information.
Our analysis builds on previous research that has examined the relationship between educational achievement and various factors. For example, the study "Asian Segregation and Scholastic Achievement: Evidence from Primary Schools in New York City" by Rocco d'Este and Elias Einiö suggested a relationship between educational achievement in Asian students relative to their peers. Our analysis will expand on this by examining the relationships across all major racial demographics in NYC public schools.
The relationship between borough location, infrastructure, and income could be linked to educational outcomes. Places with more resources and higher median incomes may see better SAT scores. A paper titled "Small High Schools and Student Achievement: Lottery-Based Evidence from New York City" by Atila Abdulkadiroglu, Weiwei Hu, and P. Pathak suggests that smaller schools and class sizes lead to more college acceptances.
Our analysis suggests a possible linear relationship between SAT scores and these X variables. The regression graphs above reveal a linear trend between SAT scores and median income, as well as between SAT scores and population. However, the relationship with median income is slightly heteroskedastic, so we will take the log transformation of that X variable to eliminate this effect. Population also shows a lot of scatter around the OLS line, so we may consider logging that variable as well.
While we acknowledge that potential non-linearities and endogeneity issues could complicate our analysis, the theory supports a linear relationship between SAT scores and our X variables.
!pip install stargazer
!pip install linearmodels
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from linearmodels.iv import IV2SLS
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
sat_income = pd.merge(df, income, on='Zip Code', how='inner')
sat_income.dropna(inplace=True)
import numpy as np
import statsmodels.api as sm
from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
# Take the natural logarithm of Median Income
sat_income['ln(Median Income)'] = np.log(sat_income['Median Income'])
# Add a constant
# Add a constant
X = sat_income[['ln(Median Income)']].assign(const=1275)
# Create the regression model with ln(Median Income) and constant
reg1 = sm.OLS(endog=sat_income['Average Total SAT Score'], exog=X,
missing='drop')
results0 = reg1.fit()
# Print the regression results
stargazer = Stargazer([results0])
stargazer.custom_columns(["Average Total SAT Scores"],[1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | |
| Average Total SAT Scores | |
| (1) | |
| const | 0.076 |
| (0.172) | |
| ln(Median Income) | 106.174*** |
| (19.751) | |
| Observations | 374 |
| R2 | 0.072 |
| Adjusted R2 | 0.070 |
| Residual Std. Error | 187.964 (df=372) |
| F Statistic | 28.897*** (df=1; 372) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 |
This regression examines the relationship between Average SAT scores as the dependent variable and median income as the independent variable. The table suggests that a 10% increase in median income will lead to .....increase in SAT scores, which is both statistically and slightly economically significant. The p-value is significant at a 1% level meaning there is strong evidence in support of using median income as a predictor of SAT scores. However, the model's R squared of 0.072 indicates that variation in median income only explains a small amount of variation in SAT scores. So, while income seems to be a statistically significant predictor of SAT scores, the model may not be the best fit.
import statsmodels.api as sm
import numpy as np
population_df['ln(Population)'] = np.log(population_df['Population'])
# Add constant to exogenous variables
X = sm.add_constant(population_df['ln(Population)'])
reg1 = sm.OLS(endog=population_df['Average Total SAT Score'], exog=X, missing='drop')
results4 = reg1.fit()
#Print the regression results
stargazer = Stargazer([results4])
stargazer.custom_columns(["Average Total SAT Scores"],[1])
HTML(stargazer.render_html())
#print(stargazer.render_latex())
| Dependent variable:Average Total SAT Score | |
| Average Total SAT Scores | |
| (1) | |
| const | 2154.656*** |
| (205.449) | |
| ln(Population) | -80.187*** |
| (18.714) | |
| Observations | 374 |
| R2 | 0.047 |
| Adjusted R2 | 0.044 |
| Residual Std. Error | 190.484 (df=372) |
| F Statistic | 18.360*** (df=1; 372) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 |
Similar to the previous model, we can apply the same analysis to this model. When examining the relationship between the population in a NYC zipcode and SAT scores, we find that a 10% increase in population is associated with a .... decrease in SAT scores. This result is highly statistically significant with a p-value significant at a 1% level. However, similar to the previous model, this model also has a very low R squared, indicating that variation in population only explains a small amount of variation in SAT scores. As a result, this model may also not be the best fit.
# Create a list of exogenous variables for Model 2
population_df['ln(Population)'] = np.log(population_df['Population'])
# Add constant to exogenous variables
X = sm.add_constant(population_df['ln(Population)'])
# Take the natural logarithm of Median Income
sat_income['ln(Median Income)'] = np.log(sat_income['Median Income'])
# Add a constant
X2 = sm.add_constant(sat_income['ln(Median Income)'])
# Concatenate the two sets of exogenous variables
X3 = pd.concat([X, X2.iloc[:,1]], axis=1)
# Estimate an OLS regression for each set of variables
reg1 = sm.OLS(sat_income['Average Total SAT Score'], X, missing='drop').fit()
reg2 = sm.OLS(sat_income['Average Total SAT Score'], X2, missing='drop').fit()
reg3 = sm.OLS(sat_income['Average Total SAT Score'], X3, missing='drop').fit()
from statsmodels.iolib.summary2 import summary_col
info_dict = {'R-squared': lambda x: f"{x.rsquared:.2f}",
'No. observations': lambda x: f"{int(x.nobs):d}"}
results_table = summary_col(results=[reg1, reg2, reg3],
float_format='%0.2f',
stars=True,
model_names=['Model 1', 'Model 2', 'Model 3'],
info_dict=info_dict,
regressor_order=['const',
'ln(Population)',
'ln(Median Income)'])
#Print the regression results
stargazer = Stargazer([reg1, reg2, reg3])
stargazer.custom_columns(["Model 1", "Model 2", "Model 3"], [1, 1, 1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | |||
| Model 1 | Model 2 | Model 3 | |
| (1) | (2) | (3) | |
| const | 2154.656*** | 96.605 | 796.747* |
| (205.449) | (219.492) | (414.471) | |
| ln(Median Income) | 106.174*** | 84.436*** | |
| (19.751) | (22.507) | ||
| ln(Population) | -80.187*** | -41.841** | |
| (18.714) | (21.043) | ||
| Observations | 374 | 374 | 374 |
| R2 | 0.047 | 0.072 | 0.082 |
| Adjusted R2 | 0.044 | 0.070 | 0.077 |
| Residual Std. Error | 190.484 (df=372) | 187.964 (df=372) | 187.222 (df=371) |
| F Statistic | 18.360*** (df=1; 372) | 28.897*** (df=1; 372) | 16.540*** (df=2; 371) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 | ||
Model 3: $$ \widehat{satscores}_i = 796.747 + 84.436 \ {income}_i -41.841 \ {population}_i + \epsilon $$
In this regression, we start with two previously seen models and add a third model that combines population and median income. Although the R squared increases slightly, it remains quite low. Overall, we gather that the variation in our independent variables does not account for much variation in our dependent variable. However, our modeling is still useful for understanding the relationships between our variables. It confirms what we have previously observed in our visualizations: increases in income and decreases in population are associated with higher SAT scores. We cannot infer causation, but a relationship does exist. Furthermore, in line with the previous studies we mentioned earlier, there are theories that support the idea that areas with more resources and personal specialization for students tend to have better outcomes.
import pandas as pd
# Create dummies for the 'borough' column
dummies = pd.get_dummies(sat_income['Borough'])
# Concatenate the original data with the dummies
dummy_borough = pd.concat([sat_income, dummies], axis=1)
# Drop the original 'borough' column
dummy_borough.drop('Borough', axis=1, inplace=True)
dummy_borough.head()
| School ID | School Name | Building Code | Street Address | City | State | Zip Code | Latitude | Longitude | Phone Number | ... | Average Score (SAT Writing) | Percent Tested | Average Total SAT Score | Median Income | ln(Median Income) | Bronx | Brooklyn | Manhattan | Queens | Staten Island | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01M539 | New Explorations into Science, Technology and ... | M022 | 111 Columbia Street | Manhattan | NY | 10002 | 40.71873 | -73.97943 | 212-677-5190 | ... | 601.0 | 91.00% | 1859.0 | 37093.0 | 10.521184 | 0 | 0 | 1 | 0 | 0 |
| 1 | 02M294 | Essex Street Academy | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | 212-475-4773 | ... | 387.0 | 78.90% | 1193.0 | 37093.0 | 10.521184 | 0 | 0 | 1 | 0 | 0 |
| 2 | 02M308 | Lower Manhattan Arts Academy | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | 212-505-0143 | ... | 415.0 | 65.10% | 1261.0 | 37093.0 | 10.521184 | 0 | 0 | 1 | 0 | 0 |
| 3 | 02M545 | High School for Dual Language and Asian Studies | M445 | 350 Grand Street | Manhattan | NY | 10002 | 40.71687 | -73.98953 | 212-475-4097 | ... | 463.0 | 95.90% | 1529.0 | 37093.0 | 10.521184 | 0 | 0 | 1 | 0 | 0 |
| 4 | 01M292 | Henry Street School for International Studies | M056 | 220 Henry Street | Manhattan | NY | 10002 | 40.71376 | -73.98526 | 212-406-9411 | ... | 381.0 | 59.70% | 1197.0 | 37093.0 | 10.521184 | 0 | 0 | 1 | 0 | 0 |
5 rows × 29 columns
from statsmodels.api import add_constant
from statsmodels.regression.linear_model import OLS
from IPython.display import HTML
from stargazer.stargazer import Stargazer
# Select only the borough columns from the dummy_borough dataframe
borough_cols = ['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']
borough_data = dummy_borough[borough_cols].copy()
# Set Manhattan as the reference category
borough_data.drop('Manhattan', axis=1, inplace=True)
# Concatenate the original data with the dummies
borough_sat = pd.concat([dummy_borough['Average Total SAT Score'], borough_data], axis=1)
# Create the regression model with a constant
X = add_constant(borough_sat.drop('Average Total SAT Score', axis=1))
y = borough_sat['Average Total SAT Score']
reg = OLS(endog=y, exog=X, missing='drop')
results2 = reg.fit()
# Print the regression results
stargazer = Stargazer([results2])
stargazer.custom_columns(["Average Total SAT Scores"],[1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | |
| Average Total SAT Scores | |
| (1) | |
| Bronx | -137.410*** |
| (26.916) | |
| Brooklyn | -109.878*** |
| (26.262) | |
| Queens | 3.292 |
| (29.607) | |
| Staten Island | 98.865 |
| (61.309) | |
| const | 1340.135*** |
| (19.485) | |
| Observations | 374 |
| R2 | 0.120 |
| Adjusted R2 | 0.110 |
| Residual Std. Error | 183.823 (df=369) |
| F Statistic | 12.541*** (df=4; 369) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 |
This model is quite interesting. Our dependent variable is still SAT scores, with dummy variables for Bronx, Brooklyn, Queens, and Staten Island as our independent variables. Manhattan is omitted and is our reference category. The intercept tells us that this omitted variable has an average SAT score of about 1340 points. Bronx and Brooklyn have negative coefficients, meaning that, on average, students score between 110 and 140 points less than students in Manhattan. Queens has very similar scores to those in Manhattan, while Staten Island has scores about 100 points higher.
Overall, the table suggests that there are significant differences in SAT scores across different boroughs of New York City. However, the model is still limited in its explanatory power, as the included variables account for only a small portion of the variation in SAT scores, as seen by the low R squared.
# Select the relevant columns
race_cols = ['Percent Asian', 'Percent White', 'Percent Black', 'Percent Hispanic']
race_data = sat_income[race_cols]
# Concatenate the original data with the race data
data = pd.concat([sat_income['Average Total SAT Score'], race_data], axis=1)
# Create the regression model
X = data.drop('Average Total SAT Score', axis=1)
y = data['Average Total SAT Score']
reg = sm.OLS(endog=y, exog=X, missing='drop')
results3 = reg.fit()
# Print the regression results
stargazer = Stargazer([results3])
stargazer.custom_columns(["Average Total SAT Scores"],[1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | |
| Average Total SAT Scores | |
| (1) | |
| Percent Asian | 17.972*** |
| (0.489) | |
| Percent Black | 12.237*** |
| (0.193) | |
| Percent Hispanic | 11.226*** |
| (0.174) | |
| Percent White | 18.902*** |
| (0.539) | |
| Observations | 374 |
| R2 | 0.989 |
| Adjusted R2 | 0.989 |
| Residual Std. Error | 135.019 (df=370) |
| F Statistic | 8443.869*** (df=4; 370) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 |
Here, we use the racial composition of NYC high schools as our independent variables. The high R squared value of 0.989 indicates that almost all of the variation in SAT scores can be explained by the variation in our independent variables. There is a positive effect among all the variables, but some lead to higher scores than others. An increase of 1 percentage point in the proportion of Asian or White students is associated with about an 18-point increase in SAT scores. For Hispanic and Black students, the increase is smaller, at about 11 points.
Our model appears to be a good fit in this case, but once again, we cannot imply causation. However, we can observe that a clear relationship exists in favor of students from Asian or White backgrounds.
race_cols = ['Percent Asian', 'Percent White', 'Percent Black', 'Percent Hispanic', 'ln(Median Income)']
race_data = sat_income[race_cols]
data = pd.concat([sat_income['Average Total SAT Score'], race_data], axis=1)
X = data.drop('Average Total SAT Score', axis=1)
y = data['Average Total SAT Score']
reg2 = sm.OLS(endog=y, exog=X, missing='drop')
results4 = reg2.fit()
stargazer = Stargazer([results3, results4])
stargazer.custom_columns(["Model 1", "Model 2"], [1, 1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | ||
| Model 1 | Model 2 | |
| (1) | (2) | |
| Percent Asian | 17.972*** | 11.844*** |
| (0.489) | (1.627) | |
| Percent Black | 12.237*** | 6.278*** |
| (0.193) | (1.524) | |
| Percent Hispanic | 11.226*** | 5.547*** |
| (0.174) | (1.451) | |
| Percent White | 18.902*** | 12.455*** |
| (0.539) | (1.719) | |
| ln(Median Income) | 52.183*** | |
| (13.244) | ||
| Observations | 374 | 374 |
| R2 | 0.989 | 0.990 |
| Adjusted R2 | 0.989 | 0.989 |
| Residual Std. Error | 135.019 (df=370) | 132.445 (df=369) |
| F Statistic | 8443.869*** (df=4; 370) | 7023.394*** (df=5; 369) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 | |
Model 2: $$ \widehat{satscores}_i = 96.605 +11.844 \ {Asian}_i + 6.278 \ {Black}_i + 5.547 \ {Hispanic}_i + 12.455 \ {White}_i + 52.183\ {Income}_i + \epsilon $$
This model serves as a continuation of the previous one, with the addition of the log of median income as an independent variable. The effect on the results is not too drastic, as the R-squared value remains similar. When controlling for income, an increase of one percentage point in the proportion of Asian or White students is associated with a 12-point increase in SAT scores. For Hispanic and Black students, the increase is around 6 points, which is half the magnitude.
When holding racial demographics constant, a 10% increase in income is associated with a 5.2-point increase in the Average Total SAT scores. These values are statistically and moderately economically significant, indicating that both racial backgrounds and income play a role in determining SAT scores.
total = sat_income[['ln(Median Income)','Average Total SAT Score']].copy()
mergedclare = pd.merge(total, borough_sat, on='Average Total SAT Score')
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from IPython.core.display import HTML
from stargazer.stargazer import Stargazer
# Create a dataframe with borough information only
exog_cols_borough = ['Bronx', 'Brooklyn', 'Queens', 'Staten Island']
exog_vars_borough = mergedclare[exog_cols_borough]
X1 = add_constant(exog_vars_borough)
y = mergedclare['Average Total SAT Score']
reg1 = sm.OLS(endog=y, exog=X1, missing='drop')
results1 = reg1.fit()
# Take the natural logarithm of Median Income
sat_income['ln(Median Income)'] = np.log(sat_income['Median Income'])
# Add borough information and income to the dataframe
exog_cols_all = ['Bronx', 'Brooklyn', 'Queens', 'Staten Island', 'ln(Median Income)']
exog_vars_all = mergedclare[exog_cols_all]
X2 = add_constant(exog_vars_all)
reg2 = sm.OLS(endog=y, exog=X2, missing='drop')
results2 = reg2.fit()
# Combine the results of both models into a single table
stargazer = Stargazer([results1, results2])
stargazer.custom_columns(["Model 1", "Model 2"], [1, 1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | ||
| Model 1 | Model 2 | |
| (1) | (2) | |
| Bronx | -89.334*** | -67.503*** |
| (16.327) | (16.829) | |
| Brooklyn | -77.167*** | -69.626*** |
| (15.692) | (15.575) | |
| Queens | 14.975 | 19.779 |
| (19.203) | (18.979) | |
| Staten Island | 141.594*** | 135.210*** |
| (41.868) | (41.339) | |
| const | 1291.006*** | 679.352*** |
| (11.858) | (136.753) | |
| ln(Median Income) | 54.289*** | |
| (12.094) | ||
| Observations | 714 | 714 |
| R2 | 0.098 | 0.123 |
| Adjusted R2 | 0.093 | 0.117 |
| Residual Std. Error | 155.516 (df=709) | 153.457 (df=708) |
| F Statistic | 19.275*** (df=4; 709) | 19.867*** (df=5; 708) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 | |
We will now examine a regression model that includes median income as an additional independent variable to the location regression model previously discussed. The R-squared value remains roughly the same as before, and it is relatively low, indicating that the variation in SAT scores is only moderately explained by the variation in borough and income data.
When the analysis controls for different boroughs, a 10% increase in income is linked to a 5.4-point increase in SAT scores. Furthermore, when controlling for income, a student from the Bronx or Brooklyn scores about 68 points lower on average than a student from Manhattan. On the other hand, students from Queens have scores about 20 points higher, while students from Staten Island have scores about 135 points higher. All of these findings are statistically significant, except for Queens, indicating that the borough where a student lives and their income background appear to influence their SAT scores. Moreover, these results are highly economically significant, with over a 200-point discrepancy in scores between Brooklyn and Staten Island.
totaltotal = pd.merge(mergedclare, sat_income, on=['ln(Median Income)', 'Average Total SAT Score'])
ln_pop = population_df[['ln(Population)', 'Average Total SAT Score']].drop_duplicates(subset='Average Total SAT Score')
totaltotal2 = pd.merge(totaltotal, ln_pop, on='Average Total SAT Score')
# Select columns for exogenous variables
exog_cols = ['Bronx', 'Brooklyn', 'Queens', 'Staten Island', 'Percent Asian', 'Percent White', 'Percent Black', 'Percent Hispanic', 'ln(Median Income)', 'ln(Population)']
exog_vars = totaltotal2[exog_cols]
# Create the regression model
X3 = exog_vars
y = totaltotal2['Average Total SAT Score']
reg3 = sm.OLS(endog=y, exog=X3, missing='drop')
results5 = reg3.fit()
# Print the regression results
stargazer = Stargazer([results0, results2, results5])
stargazer.custom_columns(["Model 1", "Model 2", "Model 3"],[1, 1, 1])
HTML(stargazer.render_html())
| Dependent variable:Average Total SAT Score | |||
| Model 1 | Model 2 | Model 3 | |
| (1) | (2) | (3) | |
| Bronx | -67.503*** | -27.734** | |
| (16.829) | (12.521) | ||
| Brooklyn | -69.626*** | -56.811*** | |
| (15.575) | (11.921) | ||
| Percent Asian | 11.585*** | ||
| (1.338) | |||
| Percent Black | 7.070*** | ||
| (1.286) | |||
| Percent Hispanic | 6.371*** | ||
| (1.244) | |||
| Percent White | 13.489*** | ||
| (1.416) | |||
| Queens | 19.779 | -31.204** | |
| (18.979) | (14.457) | ||
| Staten Island | 135.210*** | -64.835** | |
| (41.339) | (30.321) | ||
| const | 0.076 | 679.352*** | |
| (0.172) | (136.753) | ||
| ln(Median Income) | 106.174*** | 54.289*** | 41.652*** |
| (19.751) | (12.094) | (8.292) | |
| ln(Population) | 6.080 | ||
| (6.328) | |||
| Observations | 374 | 714 | 740 |
| R2 | 0.072 | 0.123 | 0.992 |
| Adjusted R2 | 0.070 | 0.117 | 0.992 |
| Residual Std. Error | 187.964 (df=372) | 153.457 (df=708) | 113.060 (df=730) |
| F Statistic | 28.897*** (df=1; 372) | 19.867*** (df=5; 708) | 9047.790*** (df=10; 730) |
| Note: | *p<0.1; **p<0.05; ***p<0.01 | ||
In this regression, we include all our previously talked about independent variables together. I want to mainly focus on Model 3. In this case, the R squared is extremely high, meaning 94% of variation in SAT scores is explained by all of our independent variables.
First, we notice for borough variables, when holding all else constant, being in the Bronx or Brooklyn shows a negative relationship, which means their scores are, on average, 20 and 38 points lower than scores in Manhattan, respectively. These results are significant at a 1% level. Queens and Staten Island had positive coefficients in Model 1, but in Model 3, these coefficients become insignificant and close to zero. This indicates that the effects of these boroughs on average total SAT scores are not statistically significant after accounting for other variables in the model.
Regarding racial demographics, when controlling for income, location and population, there is still a positive relationship which is higher for students from White and Asian backgrounds. For example, a one-unit increase in the percentage of Asian students in a school is associated with about a 10 point increase in SAT scores.
Median income shows similar effects to what we had seen before. When controlling for all else, a 10% increase in income is associated with about a 4 point increase in SAT scores.
However, the coefficient for population is not statistically significant. Therefore, when considering all the other variables, population is not a significant indicator in our model.
My preferred specification would be the regression with the racial demographics and income as independent variables, because it includes all four racial demographics (Asian, Black, Hispanic, and White), which allows us to examine the relationship between each racial demographic and average SAT scores. A very strong relationship emerged from our analysis and the R^2 of 0.989 tells us that racial demographics explain almost all of the variation in SAT scores. This does not imply causation, but even when including Median Income as a control, the R^2 was still quite high at 0.683. Also, all the coefficient estimates are statistically significant at a 1% level. This could indicate that each racial demographic has a significant positive effect on SAT scores.
To evaluate the regressions, we used several methods that were shown in our tables such as the R squared, the F-statistic, the t-statistic and the residual standard error (MSE). The R squared measures the amount of variation in the dependent variable that is explained by variation in the independent variables. We saw very high and very low R-squared in our models.
The F-statistic is a measure of the overall significance of the model. A high statistic with a low p-value indicate that the model is statistically significant.
we can use various measures that are shows in the tables such as R-squared, F-statistic, t-statistics, and the residual standard error.
R-squared measures the proportion of the variation in the dependent variable that is explained by the independent variables. In the context of the SAT score regression, R-squared indicates the percentage of the variation in SAT scores that can be explained by the independent variables included in the model.
The F-statistic is a measure of the overall significance of the model. A high F-statistic and a low p-value indicate that the model as a whole is statistically significant.
The t-statistic does the same thing, a high t-value and low p-value indicate the the independent variables are statistically significant.
The MSE measures how closely a regression line is to a set of data points; it measures the overall fit of a model to the OLS line. Typically, a smaller residual standard error means that the model is a better fit.
Overall, while our results are statistically significant and demonstrate important relationships between our variables (such as income, population, borough, and racial demographics) and SAT scores, they only account for a small amount of variation in the dependent variable. While we can not infer causation from these models, it is clear that the racial composition of a school and a borough in which one is from are key determinants in SAT scores.
Our objective function aims to minimize the mean squared error between SAT scores and two selected X variables: racial demographics and income. These variables were chosen based on our observations from regression trees, where models with racial demographics as independent variables showed higher R squared values and lower standard errors. Additionally, incorporating the income variable in different models improved their accuracy. Therefore, including these two variables for a regression tree seems like the right fit.
To regulate the error in our model, we can use various techniques and parameters to reduce it. The first parameter we consider is the maximum tree depth, which limits the number of levels in the tree. A shallower tree with less depth can help prevent overfitting and excessive complexity. In our regression, we choose a depth of 3 levels, as 2 may not capture enough information and 4 can result in an overcrowded and illegible tree.
The second parameter we implement is the minimum sample leaf size, which specifies the minimum number of samples required for a node to be considered a leaf. A larger sample size can result in a simpler tree with more generalized data, while a smaller sample size may lead to a more complex tree. In our case, we choose a minimum sample leaf size of 5.
Lastly, we also set a specific alpha value, typically ranging from 0 to 1. In our case, we set α = 0.8. A higher alpha indicates stronger regularization, resulting in more aggressive pruning of values that do not fit well in the context. If we were to decrease the alpha, our tree would have more branches and leaves.
no_ln = pd.merge(sat_income, population_df)
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
y = totaltotal2["Average Total SAT Score"]
X = totaltotal2.loc[:, ["Percent White", "Percent Black", "Percent Asian", "Percent Hispanic", "Median Income"]]
lr_model = linear_model.LinearRegression()
lr_model.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
print(lr_model.intercept_)
print(lr_model.coef_)
2986.8340955560307 [-1.22664489e+01 -1.85620701e+01 -1.41184017e+01 -1.88230932e+01 1.85622517e-04]
y_pred_linear = lr_model.predict(X)
from sklearn import metrics
full_mse = metrics.mean_squared_error(y, y_pred_linear)
print('Mean Squared Error:', full_mse)
Mean Squared Error: 11443.494401730657
from sklearn import tree
max_depth = 3 # Maximum tree depth
min_samples_leaf = 5 # Minimum number of samples per leaf
ccp_alpha = 1 # Cost Complexity Pruning (CCP) parameter
max_features = 0.8 # Consider 80% of features at each node
sat_tree = tree.DecisionTreeRegressor(max_depth=max_depth, min_samples_leaf=min_samples_leaf)
sat_tree.fit(X, y)
DecisionTreeRegressor(max_depth=3, min_samples_leaf=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(max_depth=3, min_samples_leaf=5)
# use the fitted tree to predict
y_pred_tree = sat_tree.predict(X)
# find the error of prediction (MSE)
from sklearn import metrics
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_tree))
Mean Squared Error: 8900.016342481884
sat_fig = plt.figure(figsize=(25,20))
sat_fig = tree.plot_tree(sat_tree, feature_names=X.columns, filled=True)
The first node of the tree is based on the independent variable "Percent White." For a sample size of 380 schools, when splitting the data based on the proportion of white students in a school being less than or equal to 20.25%, the predicted score that satisfies this condition is about 1244 points.
Moving on to the second level of the tree, if a school has a population with less than 20% white students and less than 8.5% Asian students, the predicted scores are around 1225 points.
For the final nodes on one split of the tree, with a smaller sample size of 254 schools, when a school has less than 2.45% Asian students, the predicted scores tend to be around 1193 points.
In the other split from the main node, when schools have less than 22% Hispanic students, which means they are comprised of other races besides Hispanic, the predicted scores tend to be much higher, around 1590 points.
When considering the addition of income to the tree, if the median income of the school's neighborhood is less than $60,000 the predicted scores trend towards 1287 points. However, if the median income is higher than 60,000, the predicted scores trend towards 1470 points, indicating a significant economic difference.
Squared error varies across different nodes in the regression tree, indicating varying levels of fit between the model and the data. For the first node, the squared error is much lower compared to when it splits off on the right side, where errors reach 68,254 squared points.
To put the errors in perspective, let's take the square root of some of these values. For the first node, the squared error is approximately 26,037, which when square rooted is 161 points. This means that the differences between the predicted and actual scores are around 161 points, which is not a significant difference, indicating that our model is a good fit for the data.
When looking at the left-hand side of the tree, the squared error ranges between 3796 and 28,176, which, when square rooted, corresponds to a range of 61 points to 167 points. Again, these differences are not substantial.
On the right-hand side of the tree, errors range between 118 points and 261 points. While this range is higher than the left-hand side of the tree, it is still relatively low.
Overall, our model performs well in predicting SAT scores based on the given independent variables.
Comparing to our regression results, the regression tree provides a more visual representation of our findings. While the regression table offers more information such as t-statistics and p-values, the regression tree summarizes all the information in a concise manner.
Furthermore, the tree is capable of capturing non-linear relationships among variables, which can be challenging to do directly with regression tables. For instance, in our case, we observed that taking the logarithm of income and population resulted in more significant findings, and instrumental variables could have also been used. However, the tree instantly captures these non-linear relationships.
The tree allows us to directly visualize how schools with lower percentages of White and Asian students tend to have lower scores, compared to schools with lower percentages of Hispanic students, which tend to have higher scores. Additionally, the relationship between income and SAT scores is also captured.
Also, regression trees are capable of capturing interaction effects, such as the effect of racial demographics on SAT scores depending on one's income level. These effects may not be captured in a regression table.
The predictive aspect of the tree is particularly interesting. Starting with large samples of schools, the tree progressively narrows down the samples at each level, leading to more specific inferences, while taking into account the measure of error.
Let's delve into precision. In our regression table titled "SAT scores, Racial Demographics, and Median Income," we observe different coefficient estimates for each racial category. However, what we cannot determine is if some effects dominate others. Are the effects of each individual race equally significant on SAT scores? Students of Black and Hispanic backgrounds experience a lower increase in their scores compared to students of White and Asian backgrounds, but which group is affected more by their ethnicity?
Upon examining the tree, we notice that the variable "Percent Black" does not appear in any of the nodes, which could indicate that its effects on SAT scores are not as strong as the other variables. On the other hand, "Percent Asian" and "Percent Hispanic" are used multiple times for splitting in the tree, suggesting that they are considered important for partitioning the data and capturing different patterns. Similarly, "Percent White" is an important predictor, as it is the first variable used to partition the data and may have a greater impact on the overall performance of the model.
Our study analyzed the relationship between SAT scores and school demographics in New York City public schools. We focused mainly on the borough data, racial demographics, population data, and median income data as key variables. Our findings provide insights into the factors that may influence standardized testing scores, such as SAT scores, and shed light on potential ways to improve educational outcomes.
Our analysis revealed that the average SAT scores were primarily centered around 400 points, per subject, with some schools performing better than others. We also found a correlation between the racial composition of the schools and the students' performance on the SAT, with white students tending to score higher than students of an ethnic minority. Our analysis of the correlation heatmap showed that the students' racial background played a significant role in their SAT scores. Specifically, we observed that Percent White and Percent Asian were positively correlated with higher SAT scores, while Percent Hispanic and Percent Black were negatively correlated with lower SAT scores.
Additionally, our examination of different maps led us to conclude that income and location were also significant factors in students' performance on the SAT. The scatter plot showed a weak positive relationship between median income and average SAT scores, indicating that income is a weak predictor of SAT scores in this dataset. However, our map analysis demonstrated that schools located in poorer boroughs tend to perform worse on the SAT than those in more affluent areas, suggesting that location may be a more significant factor in students' performance.
We build on existing research, such as studies that have examined the relationship between educational achievement and various factors. For instance, we have referred several times to Rocco d'Este and Elias Einiö’s paper on Asian segregation and scholastic outcomes. Our paper focuses more on effects to high school students and all different factors that could influence their performances. We support the findings of Rocco and Elias’ paper by finding positive correlations between racial composition and SAT scores, as schools having higher proportion of Asian and White students generally score higher than schools with mainly Hispanic and Black students.
Additionally, our analysis suggests relationships between SAT scores and median income and population, with higher incomes and smaller populations associated with higher scores. This aligns with previous research that has shown correlation between income, resources, and outcomes. However, we do acknowledge that potential non-linearities and endogeneity issues could complicate our analysis and further research could be done explore these relationships more in depth.
In future research, it would be interesting to investigate variables linked to our independent variables. What leads to different income levels? What leads to different boroughs having different infrastructure and development? Other potential factors such as school resources, class sizes and teaching quality could provide a more comprehensive understanding of underlying mechanisms driving our findings. It would also be interesting to use more qualitative research methods as opposed to observed data, such as interview and focus groups, to provide deeper insights and perspectives.
In conclusion, our study contributes to the growing body of literature on the relationship between SAT scores and school demographics, providing important insights into the factors that may influence educational outcomes in New York City public schools. Our findings suggest that addressing disparities in racial demographics and income levels may be key in creating a more equitable educational system. Further research is needed to better understand the underlying mechanisms driving these relationships and to inform evidence-based policies and interventions aimed at improving educational opportunities for all students.
Census.gov. (n.d.). United States Census Bureau. Retrieved from https://www.census.gov/
Average SAT Scores for NYC Public Schools. (n.d.). Kaggle. Retrieved from https://www.kaggle.com/datasets/nycopendata/high-schools
Top 10 Most Dangerous Neighborhoods in New York City. (n.d.). USA ESTA Online. Retrieved from https://usaestaonline.com/most-dangerous-neighborhoods-in-new-york-city
Zip Codes in New York. (n.d.). New York Demographics. Retrieved from https://www.newyorkdemographics.com/zip_codes_by_population
"The City". Poverty Rates by NYC Community District: 2015-2019. Retrieved from https://www.thecity.nyc/data/poverty-rates-by-nyc-community-district-2015-2019-09414/
d'Este, R., & Einiö, E. (Year). Asian Segregation and Scholastic Achievement: Evidence from Primary Schools in New York City. IZA Discussion Paper No. 11682. Retrieved from https://docs.iza.org/dp11682.pdf
Abdulkadiroglu, A., Hu, W., & Pathak, P. (Year). Small High Schools and Student Achievement: Lottery-Based Evidence from New York City. Journal of Applied Econometrics, 31(1), 113-137.
Graham, A. E., & Husted, T. A. (Year). Understanding state variations in SAT scores. Journal of Education Finance, 45(2), 241-259
d'Este, R., & Einiö, E. (Year). Asian Segregation and Scholastic Achievement: Evidence from Primary Schools in New York City. Education Finance and Policy, 15(4), 511-533. Retrieved from https://www.nber.org/system/files/working_papers/w19576/w19576.pdf